Netflix Data Analysis Project - Advanced

This is a project made in the context of the Tech Academy R for Data Science Advanced course at the Goethe University Frankfurt in 2022/2023. We had to use a general Netflix dataset from Kaggle in addition to our own data and go through a series of analyses, as you’ll see below.

4.1 Getting started

Import libraries

library(tidyverse)
library(lubridate)
library(plotly)
library(ggExtra)
library(wordcloud)

Load dataset

I then set my working directory and load the dataset Netflix Movies and TV Shows. I’m working with Version 5, last updated on September 27th, 2021.

# Set directory
setwd("D:/R_TechAcademy")

#Load general Netflix data
netflix_general <- read_csv("netflix_titles.csv")

4.1.1 Discovering the Data

First, we need to get an overview of the data.

glimpse(netflix_general)
## Rows: 8,807
## Columns: 12
## $ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
## $ type         <chr> "Movie", "TV Show", "TV Show", "TV Show", "TV Show", "TV …
## $ title        <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", "Ja…
## $ director     <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mike F…
## $ cast         <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Mola…
## $ country      <chr> "United States", "South Africa", NA, NA, "India", NA, NA,…
## $ date_added   <chr> "September 25, 2021", "September 24, 2021", "September 24…
## $ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, 202…
## $ rating       <chr> "PG-13", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "PG…
## $ duration     <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seasons…
## $ listed_in    <chr> "Documentaries", "International TV Shows, TV Dramas, TV M…
## $ description  <chr> "As her father nears the end of his life, filmmaker Kirst…
summary(netflix_general)
##    show_id              type              title             director        
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      cast             country           date_added         release_year 
##  Length:8807        Length:8807        Length:8807        Min.   :1925  
##  Class :character   Class :character   Class :character   1st Qu.:2013  
##  Mode  :character   Mode  :character   Mode  :character   Median :2017  
##                                                           Mean   :2014  
##                                                           3rd Qu.:2019  
##                                                           Max.   :2021  
##     rating            duration          listed_in         description       
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

All columns are set as having characters as a class, except for release_year, which is a double (numeric). The TechAcademy Leitfaden asks us to turn the release_year into a date, though I found that to be a waste of time, because even with the lubridate library converting to year, we still eventually go back to a double since the format of a year (dddd) is not actually a date format. Here is the code that I used (do correct me if I’m wrong!):

# Convert relese_year to date format
netflix_general$release_year <- year(as.Date(as.character(netflix_general$release_year), format = '%Y'))
# Check to see if that worked
class(netflix_general$release_year)
## [1] "numeric"

We should make “date_added” a date. Here it works well.

# Convert date_added to date format
netflix_general$date_added <- mdy(netflix_general$date_added)
# Check to see if that worked
class(netflix_general$date_added)
## [1] "Date"

We should also make “duration” numeric. This is a bit more complicated, because the original dataset mixes duration in minutes for films and duration in seasons for TV Shows. We should separate those two into different columns, “duration_movie” and “duration_season_number”. The cells for the wrong category should be left at “NA”.

netflix_general <- netflix_general %>%
  mutate(duration_movie = as.numeric(gsub("min", "", ifelse(type == "Movie", duration, NA))),
         duration_season_number = as.numeric(gsub("[ Season Seasons]", "", ifelse(type == "TV Show", duration, NA))))

Let’s remove the “duration” column to avoid confusion later on

netflix_general <- select(netflix_general, -'duration')

4.1.2 Give some overall statements

Longest movie

What’s the longest movie (not TV show) included in the dataset?

print(netflix_general %>% slice_max(duration_movie))
## # A tibble: 1 × 13
##   show_id type  title    direc…¹ cast  country date_added relea…² rating liste…³
##   <chr>   <chr> <chr>    <chr>   <chr> <chr>   <date>       <dbl> <chr>  <chr>  
## 1 s4254   Movie Black M… <NA>    Fion… United… 2018-12-28    2018 TV-MA  Dramas…
## # … with 3 more variables: description <chr>, duration_movie <dbl>,
## #   duration_season_number <dbl>, and abbreviated variable names ¹​director,
## #   ²​release_year, ³​listed_in

The film is called Black Mirror: Bandersnatch at 312 minutes. This is actually an interesting case, since this film is a “choose your own adventure” special episode of the the anthology series Black Mirror. Since it is interactive, there is actually no set amount of time the film takes, though Netflix themselves affirm that the run time for default choices is only 90 minutes. This shows how individual cases can often defy simple characterizations like “duration”. The 312 minutes that we see here are likely the cumulative duration of all available scenes, though a single screening of the film could never be that long.

Most represented country

Which country has the most content (movies and tv shows) featured on Netflix?

print(netflix_general %>% 
  count(country) %>%
  slice_max(n))
## # A tibble: 1 × 2
##   country           n
##   <chr>         <int>
## 1 United States  2818

The answer is not surprising - the United States, with 2818 media types in the dataset.

Overall numbers: movies and TV

How many movies and tv shows are included? Let’s make a barchart:

ggplot(netflix_general, aes(x = type)) + geom_bar() + 
  labs(x = NULL, y = NULL, title = "Number of Media on Netflix by Type")

4.2 Data Cleaning and Useful Transformations

4.2.1 Date Formatting

I had already done the suggested cleanups (conversion of “date_added” and “release_year” into date format) above in section 4.1.

4.2.2 More details of the longest movie

What is the mean and standard deviation of movie duration in minutes?

# mean
print(duration_mean <- mean(netflix_general$duration_movie, na.rm=T))
## [1] 99.57719
# standard deviation
print(duration_sd <- sd(netflix_general$duration_movie, na.rm=T))
## [1] 28.29059

4.2.3 A histogram bar chart of the top 10 longest movies duration

First, let’s take a look at what are the 10 longest films on Netflix

top_ten_length <- netflix_general %>%
  slice_max(duration_movie, n = 10)
print(top_ten_length)
## # A tibble: 10 × 13
##    show_id type  title   direc…¹ cast  country date_added relea…² rating liste…³
##    <chr>   <chr> <chr>   <chr>   <chr> <chr>   <date>       <dbl> <chr>  <chr>  
##  1 s4254   Movie Black … <NA>    Fion… United… 2018-12-28    2018 TV-MA  Dramas…
##  2 s718    Movie Headsp… <NA>    Andy… <NA>    2021-06-15    2021 TV-G   Docume…
##  3 s2492   Movie The Sc… Houssa… Suha… Egypt   2020-05-21    1973 TV-14  Comedi…
##  4 s2488   Movie No Lon… Samir … Said… Egypt   2020-05-21    1979 TV-14  Comedi…
##  5 s2485   Movie Lock Y… Fouad … Foua… <NA>    2020-05-21    1982 TV-PG  Comedi…
##  6 s2489   Movie Raya a… Hussei… Suha… <NA>    2020-05-21    1984 TV-14  Comedi…
##  7 s167    Movie Once U… Sergio… Robe… Italy,… 2021-09-01    1984 R      Classi…
##  8 s7933   Movie Sangam  Raj Ka… Raj … India   2019-12-31    1964 TV-14  Classi…
##  9 s1020   Movie Lagaan  Ashuto… Aami… India,… 2021-04-17    2001 PG     Dramas…
## 10 s4574   Movie Jodhaa… Ashuto… Hrit… India   2018-10-01    2008 TV-14  Action…
## # … with 3 more variables: description <chr>, duration_movie <dbl>,
## #   duration_season_number <dbl>, and abbreviated variable names ¹​director,
## #   ²​release_year, ³​listed_in

Now, let’s turn this into a bar chart (while the Leitfaden calls this a histogram, it’s actually a bar chart since the data is discrete - each film is a separate entity with its own length - hence the gaps between each bar, which do not exist in a histogram).

ggplot(top_ten_length, aes(x = duration_movie, y = reorder(title, duration_movie))) + 
  geom_col() +
  labs(x = "Duration (min)", y = NULL, title = "The 10 Longest Movies on Netflix")

4.2.4 Visualizing average movie durations over time

The Leitfaden suggests at this point that we analyze how the average movie length evolved with a graph.

ggplot(netflix_general, aes(x = release_year, y = duration_movie)) + 
  geom_line(stat = "summary", fun = "mean")

We were asked to comment and interpret the graph - “were there any significant increases/decreases in movie length over time? If so, what could be the reason?”

I actually don’t believe this graph of averages can tell us much about this, because the means hide the number of films they are based on per year. So I made a scatter plot with the film’s title as a hover text to better inspect the data. (You can count this as my “surprise us” plot for 4.8)

scatter_length <- ggplot(netflix_general, aes(x = release_year, y = duration_movie, text=title)) +  
  geom_point() +
  labs(x = "Release Year", y = "Duration (min)", title = "Netflix's movie durations over time")
ggplotly(scatter_length)

After actually checking the films in question, I believe this is just a quirk of the data, and cannot tell us anything about actual trends in the length of films produced throughout the decades. The Netflix dataset has few movies from the 1940s to the 1980s, making the length of individual films in this period skew the results. There is a cluster of World War 2 documentaries that have around 40 minutes that are bringing the average down in the 1940s, while in the 1960s a few epics like Doctor Zhivago are skewing durations up, but that is not representative of the average duration of films made in these periods. There are also quite a few “movies” bringing the average down in the last few decades that aren’t actually feature films, like “Power Rangers specials”, comedy specials, and special features (e.g. Creating the Queen’s Gambit).

4.3 Your personal data

In this portion, each participant used a data set based on their own viewing activity, which they requested from Netflix.

4.3.1 Load your data

my_data <- read_csv("D:/R_TechAcademy/NetflixReport/CONTENT_INTERACTION/ViewingActivity.csv")

Let’s see what it looks like:

glimpse(my_data)
## Rows: 17,161
## Columns: 10
## $ `Profile Name`            <chr> "Isa", "Isa", "Isa", "Isa", "Isa", "Isa", "I…
## $ `Start Time`              <dttm> 2022-06-27 19:02:15, 2022-06-27 18:56:05, 2…
## $ Duration                  <time> 01:29:35, 00:00:01, 01:29:21, 00:00:02, 00:…
## $ Attributes                <chr> NA, "Autoplayed: user action: User_Interacti…
## $ Title                     <chr> "Wallander: Series 1: Firewall (Episode 2)",…
## $ `Supplemental Video Type` <chr> NA, NA, NA, NA, NA, "TEASER_TRAILER", NA, NA…
## $ `Device Type`             <chr> "Amazon Fire TV Stick 2020 Lite Streaming St…
## $ Bookmark                  <time> 01:27:50, 00:00:01, 01:29:02, 00:00:02, 00:…
## $ `Latest Bookmark`         <chr> "01:27:50", "Not latest view", "01:29:02", "…
## $ Country                   <chr> "DE (Germany)", "DE (Germany)", "DE (Germany…

The data types are better than in netflix_general, but there are still some issues we need to solve.

4.3.2 Clean and transform dataset

The cleaning here involves several steps. First, we remove things that aren’t films by taking out anything that isn’t NA in “Supplemental Video Type”. This category is just for trailers, recaps, special features, etc. Then, we run into the issue that the column “Title” specifies TV show episodes by name, making it too granular and incompatible with the larger Netflix dataset we have, where series titles are isolated.

Since we will want to merge those two, they need to fit together. We need to split that column into 3 columns - ‘title’ (lower-case, as in ‘netflix_general’), ‘season’ and ‘episode_title’. This information is split on colons (:), but we can’t just split them like that here, because some films have colons on their names. So we first check that the column ‘title’ has words like “season” and equivalents to send that information to the corresponding column, and we do the same with the word “episode” and equivalent, pushing it to the column ‘episode_title’.

my_data <- my_data %>%
  #filter out supplemental videos
  filter(is.na(`Supplemental Video Type`)) %>% 
  #separate titles for TV show episodes
  separate(col=Title, into=c("title", "season", "episode_title"), 
           sep=': ', remove=TRUE) %>% 
  mutate(title=ifelse(grepl("Season", season) | 
                        grepl("Series", season) | 
                        grepl("Staffel", season) |
                        grepl("Episode", episode_title) | 
                        grepl("Chapter", episode_title) |
                        is.na(season),  title, paste(title, season, sep =": ")),
         # Identify how "start Time" is organized
         `Start Time`= ymd_hms(`Start Time`),
         # Split into columns for date, month, weekday and start time, which we'll need later
         viewing_date = date(`Start Time`),
         viewing_month = month(`Start Time`, label=TRUE),
         viewing_year = year(`Start Time`),
         viewing_weekday = wday(`Start Time`, label=TRUE),
         start_time = hms(format(as.POSIXct(`Start Time`), format = "%H:%M:%S")))%>%
  # Rename "Duration" column to watch_time to avoid confusion
  rename(watch_time = Duration)

The TechAcademy Leitfaden says here that “Netflix recorded every time you clicked on a movie even if you didn’t watch it. Check which column indicates those with a specific value.” I imagine you are referring to the column “Attributes”, which marks if a film was autoplayed, but I don’t think this matters much, as an autoplayed film might still have been watched. I would agree that things with a very short watch_time might be better off removed to avoid bias, but looking at the data, it seems this often happens because of pauses, so that the cumulative watch_time should include those short bursts, so I’ll keep those in for now, and filter them out when needed.

Before we join the datasets, let’s remove the columns we don’t want from my_data so the joined data isn’t unnecessarily large.

my_data <- select(my_data, -c("Start Time", "Attributes", "Supplemental Video Type", "Device Type", "Bookmark", "Latest Bookmark", "Country"))

The only column now that still had spaces in the name was “Profile Name”. Let’s change that because it’s kind of annoying, sometimes causing problems with selection.

my_data <- my_data %>% 
  rename(profile_name = "Profile Name")

4.3.3 Join datasets

Now let’s join! Both datasets have a column called “title” which is reasonably clean now, so let’s use that.

netflix_combined <- my_data %>% 
  left_join(netflix_general, by = "title")

4.3.5 Dynamic Interactive line plot

Our goal for this task is to plot how each viewer’s activity was recorded over time. First, we need to group the watch times per day for each viewer (this if you have different viewers in your account, as I do):

by_date <- netflix_combined %>% 
  group_by(profile_name, viewing_date) %>% 
  mutate(watchtime_per_day = as.period((sum(watch_time))))

TechAcademy recommended a dynamic chart here since it would be a bit unclear in a static plot, but I figured it would be even clearer as an interactive plot:

per_day_plot <- ggplot(by_date, aes(y = watchtime_per_day, x=viewing_date, color=profile_name))+
  geom_point()+
  geom_line()+
  scale_y_time(name = "Watch time per day")+
  scale_x_date(date_breaks = "9 month", date_labels = "%Y-%m")+ 
  theme(axis.text.x = element_text(angle= 45, hjust=1))
# Make it interactive  
ggplotly(per_day_plot)

My mother, Glaucia, seems to be the big binge watcher in the family! She has some extraordinary bursts (17h and 34 minutes on May 30th, 2021!) of activity that are almost suspiciously long. I’ll investigate that in a bit. The other users binge-watching activity is more normal, with peaks around 8h in a day. Glaucia, Caio and Isa (that’s me) all started using the Netflix account in 2017, but Samuel only started in April 2020 - perhaps a pandemic-related change in viewing habits?

netflix_combined %>% 
  filter(profile_name == "glaucia") %>% 
  filter(viewing_date == "2021-05-30")
## # A tibble: 44 × 22
##    profile_name watch_…¹ title season episo…² viewing_…³ viewi…⁴ viewi…⁵ viewi…⁶
##    <chr>        <time>   <chr> <chr>  <chr>   <date>     <ord>     <dbl> <ord>  
##  1 glaucia      22'36"   Sam … Tempo… #Assis… 2021-05-30 May        2021 Sun    
##  2 glaucia      22'51"   iCar… Tempo… Sonhos… 2021-05-30 May        2021 Sun    
##  3 glaucia      23'38"   iCar… Tempo… Não qu… 2021-05-30 May        2021 Sun    
##  4 glaucia      23'37"   iCar… Tempo… iHatch… 2021-05-30 May        2021 Sun    
##  5 glaucia      23'37"   iCar… Tempo… Sou su… 2021-05-30 May        2021 Sun    
##  6 glaucia      23'37"   iCar… Tempo… Coraçã… 2021-05-30 May        2021 Sun    
##  7 glaucia      23'41"   iCar… Tempo… A namo… 2021-05-30 May        2021 Sun    
##  8 glaucia      23'37"   iCar… Tempo… iWant … 2021-05-30 May        2021 Sun    
##  9 glaucia      23'41"   iCar… Tempo… Espion… 2021-05-30 May        2021 Sun    
## 10 glaucia      23'38"   iCar… Tempo… Quero … 2021-05-30 May        2021 Sun    
## # … with 34 more rows, 13 more variables: start_time <Period>, show_id <chr>,
## #   type <chr>, director <chr>, cast <chr>, country <chr>, date_added <date>,
## #   release_year <dbl>, rating <chr>, listed_in <chr>, description <chr>,
## #   duration_movie <dbl>, duration_season_number <dbl>, and abbreviated
## #   variable names ¹​watch_time, ²​episode_title, ³​viewing_date, ⁴​viewing_month,
## #   ⁵​viewing_year, ⁶​viewing_weekday

Ok, there’s definitely something weird going on, I doubt my mom is going on hour-long binges of iCarly! I asked her about this and she can’t think of anything other than her account info is being used by someone else that she definitely did not approve of. This suspicious activity continued for months, though it seems to have died down in mid-2022. We should change our password just the same.

4.4 Let’s get personal

Next, we’ll investigate my own viewing habits. What’s the longest movie I have ever watched on Netflix?

isa_longest <- netflix_combined %>%
  filter(profile_name == "Isa") %>%
  slice_max(duration_movie)
print(isa_longest)
## # A tibble: 3 × 22
##   profile_name watch_t…¹ title season episo…² viewing_…³ viewi…⁴ viewi…⁵ viewi…⁶
##   <chr>        <time>    <chr> <chr>  <chr>   <date>     <ord>     <dbl> <ord>  
## 1 Isa          14'03"    The … <NA>   <NA>    2019-12-17 Dec        2019 Tue    
## 2 Isa          49'56"    The … <NA>   <NA>    2019-12-01 Dec        2019 Sun    
## 3 Isa          23'38"    The … <NA>   <NA>    2019-11-30 Nov        2019 Sat    
## # … with 13 more variables: start_time <Period>, show_id <chr>, type <chr>,
## #   director <chr>, cast <chr>, country <chr>, date_added <date>,
## #   release_year <dbl>, rating <chr>, listed_in <chr>, description <chr>,
## #   duration_movie <dbl>, duration_season_number <dbl>, and abbreviated
## #   variable names ¹​watch_time, ²​episode_title, ³​viewing_date, ⁴​viewing_month,
## #   ⁵​viewing_year, ⁶​viewing_weekday

I’m embarrassed to say I tried getting through The Irishman on three different days and in the end I never finished the movie, it was too long…

4.4.1 Monthly viewing time in 2021

Now, we are supposed to analyze how my viewing time has changed throughout one year, 2021. First, let’s filter for my profile and the year 2021, then group by month and add the watch time up:

isa_2021_month_watchtime <- netflix_combined %>%
  filter(profile_name == "Isa" & viewing_year == 2021) %>%
  group_by(viewing_month) %>% 
  mutate(watchtime_per_month = sum(watch_time))

Now let’s select just the columns we need - the month and watch time per month - and remove duplicates:

isa_2021_month_watchtime <- select(isa_2021_month_watchtime, c("viewing_month", "watchtime_per_month"))
isa_2021_month_watchtime <- unique(isa_2021_month_watchtime)

Now let’s make a graph:

ggplot(isa_2021_month_watchtime, aes(x=viewing_month, y=watchtime_per_month)) + 
  geom_col() +
  labs(x = "Month", title = "Isa's Monthly Watch Times 2021") +
  scale_y_time(name = 'Watch time (hh:mm:ss)')

There is more variation than I was expecting. The dip in June and August could be explained as the summer months - better weather, more time outside - but the surge in July goes against that logic. I started a new job in April and was teaching from April to June, so that might contribute to the diminished watch times then.

4.4.2 Average per weekday

Now we want to analyze the viewing time of specific weekdays. On which days have I watched more Netflix, is there a peak?

We’ll use a similar formula as last time, but now we’ll group by weekday and we’ll do the mean instead of the sum:

isa_2021_weekday_watchtime <- netflix_combined %>%
  filter(profile_name =="Isa" & viewing_year == 2021) %>%
  group_by(viewing_weekday) %>% 
  mutate(watchtime_per_weekday = mean(watch_time))

Now let’s select just the columns we need - the day and watch time per month - and remove duplicates:

isa_2021_weekday_watchtime <- select(isa_2021_weekday_watchtime, c('viewing_weekday', 'watchtime_per_weekday'))
isa_2021_weekday_watchtime <- unique(isa_2021_weekday_watchtime)

Now let’s make a graph:

ggplot(isa_2021_weekday_watchtime, aes(x=viewing_weekday, y=watchtime_per_weekday)) + 
  geom_col() +
  scale_y_time(name='Watch time (hh:mm:ss)') +
  labs(x = "Weekday", title="Isa's Average Watch Times per Weekday 2021")

Saturday is second-to-last (after Monday) in my watch times, defying my preconceived notion that I’d watch more on the weekend - though Sunday is highest, as expected.

4.5 Binge watching

In this section, the goal should be to create a plot of my top 10 binge TV shows. We first want to filter for TV Shows and group by date and title. We then sum the watch time of each title per day.

binge_TV <- netflix_combined %>%
  filter(profile_name == "Isa" & type == "TV Show") %>%
  group_by(title, viewing_date) %>% 
  mutate(watchtime_per_session = sum(watch_time))

Now let’s select just the columns we need - title, day and watch time per session - and remove duplicates:

binge_TV <- select(binge_TV, c("title", "viewing_date", "watchtime_per_session"))
binge_TV <- unique(binge_TV)

Let’s take a look at the data

binge_TV[order(-binge_TV$watchtime_per_session),]
## # A tibble: 801 × 3
## # Groups:   title, viewing_date [801]
##    title                   viewing_date watchtime_per_session
##    <chr>                   <date>       <drtn>               
##  1 Stranger Things         2022-05-28   23472 secs           
##  2 Marvel's The Defenders  2017-08-27   23174 secs           
##  3 Orange Is the New Black 2019-08-08   19676 secs           
##  4 Squid Game              2021-10-15   18737 secs           
##  5 Suits                   2017-10-05   18289 secs           
##  6 Stranger Things         2017-11-01   17427 secs           
##  7 Better Call Saul        2022-05-29   17106 secs           
##  8 Stranger Things         2022-06-18   16934 secs           
##  9 Maniac                  2019-01-08   16613 secs           
## 10 Sex Education           2021-09-20   15863 secs           
## # … with 791 more rows

Sometimes the same TV Show appears multiple times, since we are counting the top binge sessions here, not TV Shows. There are several ways for us to define and analyze the “top 10 binge TV Shows”. I’m going to filter out the watch sessions that were very short - following Netflix’s own practices, a watch time of less than 2 minutes. I will then group the TV Shows by title and calculate the median watch time of that title.I think in order to identify the shows that were the most “binge-worthy” for me, the median watch time per session makes the most sense, as it indicates the typical session and isn’t swayed by outliers.

top_binge_TV <- binge_TV %>% 
    # Cut off value for movies not watched intentionally as defined by Netflix
  filter(watchtime_per_session > 120) %>%
  group_by(title) %>% 
  summarize(mean = mean(watchtime_per_session),
            sd = sd(watchtime_per_session),
            sum = sum(watchtime_per_session),
            median = median(watchtime_per_session))

Let’s see the top 10:

top10_binge_TV <- top_binge_TV %>%
  slice_max(median, n = 10)
print(top10_binge_TV)
## # A tibble: 10 × 5
##    title                          mean              sd sum         median    
##    <chr>                          <drtn>         <dbl> <drtn>      <drtn>    
##  1 Marvel's The Defenders         23174.000 secs   NA   23174 secs 23174 secs
##  2 Next in Fashion                 9662.667 secs 4232.  28988 secs  8660 secs
##  3 Stranger Things                 9199.200 secs 5797. 183984 secs  8311 secs
##  4 Bridgerton                      8361.143 secs 3644.  58528 secs  8303 secs
##  5 Halt and Catch Fire             9605.000 secs 5210.  28815 secs  7663 secs
##  6 Katla                           7831.000 secs 5507.  23493 secs  7630 secs
##  7 Spinning Out                    6840.500 secs 4042.  27362 secs  7408 secs
##  8 Chilling Adventures of Sabrina  7235.000 secs   NA    7235 secs  7235 secs
##  9 Invisible City                  7103.000 secs 3851.  14206 secs  7103 secs
## 10 The Hook Up Plan                7011.000 secs   NA    7011 secs  7011 secs

And here as a graph:

ggplot(top10_binge_TV, aes(x=reorder(title, median), y=median))+
  geom_col()+
  coord_flip()+
  scale_y_time(name='Median watch time (hh:mm:ss)') +
  labs(x="TV Show", title="Isa's Median Watch Times per TV Show Binge session")

I share my Netflix account with my boyfriend, and looking at this table makes me realize he is more of a binge watcher than me. The number one position by a wide margin - Marvel’s The Defenders - is something he watched by himself, while the second and third place Next in Fashion and Stranger Things, we watched together.

4.6 Scatterplot with marginal density

How has the watching behavior of my family developed on Netflix since we first started using it? We will visualize this via a scatterplot with marginal density. Include all the profile names for this task and make a visual comparison.

Let’s filter out the things watched for less than two minutes:

full_records <- netflix_combined %>% 
  filter(watch_time > 120)

And now let’s make a scatterplot:

scatter_marginal <- ggplot(data = full_records, 
       aes(x = viewing_date, y = watch_time)) +
  geom_point(aes(col = profile_name)) + 
  theme(legend.position = "bottom")

This is the code to add the marginal density:

print(ggMarginal(scatter_marginal, type = "density",
                   groupFill = TRUE,
                   groupColour = TRUE))

This plot is a bit too convoluted for my tastes and it’s kind of hard to interpret so many datapoints at once. Since Netflix records every time you click on something, sessions with fits and starts are recorded multiple times, creating a wall of points for some days. Most watch sessions are quite short - everything below an hour becomes a mass of points, and as we can see from the density plot on the right, they cluster in particular around the half-hour mark. From the density plot above, we confirm some of the information we already had, like the fact that Samuel only started watching using this account in 2020, but has been watching steadily since then.

4.7 Word cloud with your favorite genre

Let’s create a genre dataframe for my (Isa) genres:

isa_genres <- netflix_combined %>% 
  filter(profile_name == "Isa") %>% 
  select(listed_in) %>%
  separate_rows(listed_in, sep = ", ") %>%
  group_by(listed_in) %>% 
  summarize(freq=n())

And here’s the wordcloud:

#set seed so that wordcloud remains the same
set.seed(401)
#attempt with Wordcloud
wordcloud(words=isa_genres$listed_in, freq=isa_genres$freq, min.freq = 5, rot.per = 0.3,
                     max.words = 200, random.order = FALSE, colors = brewer.pal(6, "Dark2"))

4.8 Surprise Us!

I added an interactive scatterplot beforehand (see 4.2.4) that wasn’t asked for, so hopefully that covers the “surprise us” aspect :)

5 Content-Based Recommendation System

The machine learning aspect, in which we must create a content-base recommendation system using cosine similarity of plot, genre and actors is due on February 5th, 2023.